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Abstract 

*0 , Network service providers and customers are often concerned with aggregate performance measures 

' that span muhiple network paths. Unfortunately, forming such network-wide measures can be difficult, 

, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the 

number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore 
O , the feasibility of methods that dramatically reduce the number of paths measured in such situations 

■ while maintaining acceptable accuracy. 

We cast the problem as one of statistical prediction — in the spirit of the so-called 'kriging' problem 
in spatial statistics — and show that end-to-end network properties may be accurately predicted in many 
cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general 
framework for the prediction problem, propose a class of linear predictors for standard quantities of 
interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection 
<~i ■ may be used to effectively choose which paths to measure. We characterize the performance of the 

, resulting methods, both analytically and numerically. The success of our methods derives from the low 

effective rank of routing matrices as encountered in practice, which appears to be a new observation in 
its own right with potentially broad implications on network measurement generally. 
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I. Introduction 

In many situations it is important to obtain a network-wide view of path metrics, such as 
■ latency and packet loss rate. For example, in overlay networks regular measurement of path 
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properties is used to select alternate routes. At the IP level, path property measurements can be 
used to monitor network health, assess user experience, and choose between alternate providers, 
among other applications. Typical examples of systems performing such measurements include 
the NLANR AMP project, the RIPE Test-Traffic Project, and the Internet End-to-end Performance 
Monitoring project [1], [2], [3]. 

Unfortunately extending such efforts to large networks can be difficult, because the number of 
network paths grows as the square of the number of network endpoints. Initial work in this area 
^ . has found that it is possible to reduce the number of end-to-end measurements to the number of 
"virtual links" (identifiable link subsets) — which typically grows more slowly than the number 
of paths — and yet still recover the complete set of end-to-end path properties exactly [4], [5]. 

This quantity is a sharp lower limit that stems from a linear algebraic analysis of the rank of 
routing matrices. Measuring even one path fewer requires one to consider approximations instead 
of exact reconstructions. Specifically, one is faced with the task of measuring some paths and 
predicting the characteristics of others. The prediction of population characteristics from those 
of a sample is a classical problem in the statistical literature. The most well-known version of 
the predication problem is perhaps that which occurs in the spatial sciences, under the name of 
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kriging [6], where, for example, measurements are taken at series of spatially distributed wells 
to enable prediction of oil concentrations throughout the underlying substrate. 

In this paper, we develop a framework for what we term network kriging, the prediction of 
network path characteristics based on a small sample. Our methods exploit an observed tendency 
in real networks for sharing certain edges between many paths, i.e., a sort of "squeezing" of 
paths over these edges. We begin with a discussion of this sharing and the reduced effective 
rank of routing matrices in Section |lll followed by a development of our statistical framework 
and a path selection algorithm in Section |llll In Section |lVl we examine the performance of our 
methods using delay data from the backbone of the Abilene network. In Section |V| we conclude 
with a brief discussion. 



A. Background 

We begin by establishing some relevant notation and definitions. Let Q = (V, £) be a strongly 
connected directed graph, where the nodes in V represent network devices and the edges in £ 
represent links between those devices. Additionally, let V be the set of all paths in the network, 
and let n^, = |V|, ne = \£\, Up = \V\ denote respectively the number of devices, links and 
paths. Finally, let y E R"'' be the values of a metric measurable on all paths i E V, which is 
assumed to be a linear function of the values of the same metric on the edges j E £, expressed 
as a; e R"^ We are interested in particular in the case where Up S> rig and the linear relation 
between y and x is given hy y = Gx, where G E {0, is a routing matrix whose entries 

simply indicate the traversal of a given link by a given path via 



For example, if we let x denote the delay times for edges in the network and let y denote the 
delay times for paths in the network, then y = Gx. Additionally, the same relation holds for 
log(l — loss rate). 

As explained in Section HI our interest in this paper focuses on the problem of monitoring 
global network properties via measurements on some small subset of the paths. Note that the 
question of which paths to monitor is equivalent to the selection of an appropriate subset of rows 
in G, due to the relation y = Gx. Exploiting this insight, earlier work by Chen and colleagues [4] 
shows that in fact one can measure as few as k* ~ O(rie) paths and still recover exact knowledge 
of all network path behaviors.^ Their argument is essentially linear algebraic in nature, and is 
based upon the fact that a subset, say G, of only k* = Rank(G') independent rows of G are 
sufficient to span the range of G, i.e., to span the set {y E R"^ : y = Gx^x E R""}. As a result, 
given the measurements for paths corresponding to the rows of such a G, measurements for all 
other paths may be obtained as a function thereof. Similar work may be found in [8], in the 
context of Boolean algebras, for the problem of detecting link failures. 

'in [7] Chen and colleagues show that the number fc* of paths needed for their method scales at worst like 0(n„ lognu) in 
a collection of real and simulated networks and they argue that this behavior is to be expected in internet networks, due to the 
high degree of sharing between paths that traverse the dense core. 



II. Routing Matrices: Rank versus Effective Rank 




1 if path i traverses link j, 
otherwise. 



(1) 
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(a) Map of Abilene (b) Eigenspectrum of G 

Fig. 1. Map of the Abilene network and the eigenspectrum of one of its routing matrices. 



B. Reduced Rank 

Critical to the success of the methodology we propose in Section |III| is the concept of 
effective rank — the number of independent rows required to approximate a given matrix to 
a pre-determined level of tolerance. Effective rank is an important tool in numerical analysis 
and scientific computing {e.g., [9]), where it is often used to reduce the dimensionality of a 
linear system, generally with the goal of improving numerical stability. Here we use it to effect 
a significant reduction in measurement requirements above and beyond the levels achievable 
through the methodology in [4], [7]. In particular, we have found that routing matrices G have 
an effective rank much smaller than their actual rank, and as a result a surprisingly small number 
of rows are sufficient to adequately approximate the span of G. 



As an illustration, consider the Abilene network shown in Figure 1(a) , this is a high-performance 
network that serves Internet2 (the U.S. national research and education backbone). The network 
can be seen to consist of 11 nodes, but only 2 x 15 = 30 directed links. Accordingly, a large 
amount of sharing of these links can be expected between the 11x10 = 110 paths on the network. 
Such sharing would mean a great deal of similarity between paths and thus fewer "unique" paths 
to measure. Furthermore, similarities between paths would mean similarities between the rows 
of G which suggests that G may have an effective rank less than 30. 

A standard tool for assessing dimensionality and effective rank is the singular value decom- 
position (SVD) of G, which can be derived from an eigen- analysis of the matrix B = G^G. The 
eigenvalues of B {i.e., the squares of the singular values of G) are plotted in Figure 1(b) The 
large gap between the second and third eigenvalues, and the resulting knee in the spectrum, is 
evidence of a non-trivial amount of linear dependence among the rows of G. The effective rank 
of a matrix is determined by looking for a large gap in the spectrum that partitions the spectrum 
into large and small values. Thus the gap in Figure [T(E) suggests that as few as two paths may 
be sufficient to recover useful information about y in the Abilene network. 

Such strong spectral decay appears to be a common property of routing matrices. In Figure |2l 
we plot the spectra for five of the networks mapped by the Rocketfuel project [10]. The sharp 
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Fig. 2. Spectra of G for six networks mapped by the Rocketfuel project and Abilene. Note that the spectrum for each network 
has been re-scaled by the largest eigenvalue and the indices have been re-scaled to the unit interval. So on the horizontal axis, 
1 corresponds to the Rank(G)-th eigenvalue. 







Fig. 3. First four distinct eigenvectors oi B = G G for Abilene. Each link is drawn with a thickness that is roughly proportional 
to the magnitude of its corresponding eigenvector component. 



knee that occurs roughly 20% of the way through is evidence that the effective rank of these 
routing matrices is much smaller then their actual rank. Furthermore, a remarkable amount of 
similarity can be seen in the decay of the five spectra. 

To better appreciate the connection between this spectral behavior and network path properties, 
we turn to the eigenvectors, which here may be viewed as orthogonal vectors that capture 
independent "patterns" that occur among the paths in G. For example, the first eigenvector 
corresponds to the "direction" in link space that maximizes the path volume of y, i.e., Vi = 
argmax||a.||=i x^'^G^Ga; = argmaxn^^n^i y-^y. As can be seen in Figure |3l the energy of the first 
eigenvector for Abilene is concentrated along an east-west "path" across the northern part of the 
network, with the greatest concentration of energy occurring at the centrally located Indianapolis- 
Kansas City link. The subsequent eigenvectors can be seen to bring in successive refinements 
to this picture, with the second and third eigenvectors further emphasizing connections to and 
within the "path" indicated by the first, while with the fourth eigenvector we begin to see 
evidence of a second east-west "path" along the southern part of the network. See [11] for 
additional discussion. 



C. Connection to Betweenness 

The evidence in Figures [T(b) and |2l indicates that the effective rank of routing matrices in real 
networks can be noticeably lower than the actual rank. In the next section we will show how 
this phenomenon allows for substantial savings in measurement load for the particular problem 
of path monitoring that we have chosen to study. But we believe that the implications are in 
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fact much broader. This would suggest that the issue of reduced rank is a topic worth better 
understanding in and of itself. For example, from a practical perspective, it would be useful to 
understand how decisions in network design and route management affect the relative change 
from actual to effective rank. In this regard, connections between the spectral decay of G, on 
the one hand, and known metrics of topological structure, on the other hand, are likely to be 
useful. While a comprehensive study of this sort is beyond the scope of the present paper, we 
describe here a result establishing a connection with one such metric. 

Specifically, note that the plots in Figure |3] confirm our original intuitive notion that the 
effective rank of G bears an intimate connection with the disproportionate role played by some 
links over others in the routing of paths within the network. This observation suggests the 
relevance of the concept of betweenness centrality, a concept fundamental in the literature on 
social networks [12] and more recently being used in the study of complex networks in the 
statistical physics literature {e.g., [13], [14]). Essentially, betweenness centrality measures the 
number of paths that utilize a specified node, in the case of 'vertex centrality', or a specified 
link, in the case of 'edge centrality'.^ 

Note that the diagonal elements of the matrix B = G^G are precisely the number of paths 
routed over their respective links and hence a measure of the centrality of each link in the network. 
The off-diagonal elements measure the number of paths routed simultaneously over pairs of links, 
and might therefore be termed a measure of edge 'co-centrality' or 'co-betweenness'. The co- 
betweenness Bij of any two edges i and j will always be bounded above by the smaller of 
the two edge betweenness', i.e., Bij < min{i?j Bjj}. Hence, it might not be unreasonable to 
expect that the behavior of the eigen- spectrum of B may be related to that of its diagonal, as 
the following result shows. 

Proposition 2.1: Let B = G^G and, without loss of generality, assume that the edges have 
been ordered so that Bi i > ■ ■ ■ > Bn^^ne- Then, for = 1, . . . , n, we have < Bk^k diam(^). 



and for k > 1, 



^ < 1^ diam(^) , (2) 



where diam(^) is the diameter of the network graph Q. 

Proof of this result may be found in AppendixU The inequality in Q indicates that the spectral 
decay in G at worst parallels the decay of the edge betweenness on ^. In fact, examination of 
these quantities for the Rocketfuel datasets suggests that the decay in the can be noticeably 
faster. Nevertheless, Proposition 12.11 provides a connection through which recent and ongoing 
work on betweenness in complex networks (e.g., [13], [14]) may be found to have direct 
implications on the present context. 



III. Prediction of End-to-End Network Properties 

In this paper, we take as our monitoring goal the task of obtaining accurate (approximate) 
knowledge of a linear summary of network path conditions. That is, we seek to accurately 
predict a linear function of the path conditions y, of the form l^y where / G R"*", based on 
measurements, say ys E R'', of a subset of k paths. Two such linear summaries are the network- 
wide average, given by I'^y for /j = l/rip, and the difference between the averages over two 
groups of paths Vi and V2, given by k = l/\Vi\ for i E Vi and = —l/\V2\ for i E 7^2- The 

^The raw number of paths utilizing a node/edge is typically used under the assumption of unique shortest path routing; when 
multiple shortest paths exist, various methods of weighted counting have been proposed. See [12]. 
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prediction of I'^y from the k sampled path values in can be viewed as a particular instance of 
the classical problem of prediction in the statistical literature on sampling [15]. In this section, 
we (i) lay out our statistical framework, (ii) describe an accompanying path selection algorithm, 
and (iii) provide an analytical characterization of expected performance properties for our overall 
prediction methodology. 

A. Statistical prediction from sampled paths. 

We begin by building a model for the end-to-end properties in y. In the work that follows, 
it is only necessary that the first two moments of x and y be specified, as opposed to a full 
distributional specification. Let fi be the mean of x and let S be the covariance of x. Then the 
corresponding statistics for y are simply u = Gjj, and V = GTjG^ , respectively. 

Now fix A; < Rank(G). Let ys E denote the values yi^, ■ ■ ■ ,yi^ of the metric of interest 
for k paths ii, . . . ,ik E V that are to be sampled {i.e., measured), and let y,. G R"?^''' denote the 
values for those rip — k paths that remain. Similarly, let Gs be those rows of G corresponding to 
the k paths, zi, . . . , z^. and let Gr be the remaining rows. We have thus partitioned y and G into 
y = [y'^^y'^Y ^'^d G = and we may similarly re-express the mean and covariance 

of y as 



If the standard mean-squared prediction error (MSPE) is used to judge the quality of a 
predictor, i.e., if the quality of a predictor p^/s) is measured by MSPE(]9(?/s)) = E[{l^y—p{ys)Y], 
then the best predictor is known to be given by the conditional expectation E[l^y\y^ = Ijy^ + 
E[l'^yr\ys], where I = [Ij, /^]^ is partitioned in the same manner as y. But this predictor requires 
knowledge of the joint distributional structure. It is therefore common practice to restrict attention 
to a smaller and simpler subclass of predictors. A natural choice is the class of linear predictors, 
in which case, the best linear predictor (BLP) is given by the expression 

a^ys = fsVs + C<^r/i + - Gs/i), (4) 

where is any solution to c^Vgs = Vrs- However, without knowledge of p, the BLP in © is 
an ideal that cannot be computed. One natural solution is to estimate p from the data. Using 
generalized least squares, the mean can be estimated as p = [G^V^^^GsYG^V'^ys, where M~ 
denotes a generalized inverse of a matrix M. Substituting p for yU in Q produces an estimate 
of the BLP (an E-BLP) that is a function of only the measurements ys, the routing matrix G 
and the link covariance matrix S. Specifically, we obtain the predictor 

a^ys = Ijys + ljGr[GjV-/Gs]-G^V-,% = l^s + CK.KI'z/.- (5) 

The derivation of these and the other expressions above parallels that of similar linear prediction 
methods in spatial statistics — so-called 'kriging' methods — which motivates the name 'network 
kriging'. An example of the basic underlying argument, in the case of simple linear statistical 
models, can be found in [16, pp. 225-227]. In the case of the present context, the derivation 
requires only that Vss be invertible and that S be positive definite. 



Gs^i 

Grll 



and V 



Vss 


Vsr 




Vrs 







Gs^G'^ 

Gr^G^ 



s^y^r 

Gf^G^. 



(3) 
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B. Path Selection Algorithm 

The material in Section IIII-AI assumes a set of measurements from k paths ii, . . . ,ik E V. 
However, given the resources to measure any k paths in a network, we are still faced with 
the question of which k paths to measure. A natural response would be to choose k paths 
that minimize MSPE(a^?/s), over all subsets of k paths. Standard manipulations yield that this 
quantity has the form 

MSPE(a^y,) = C {Vrr - VrsV-'Vsr) Ir + C (KsKI'G, - G,) /i • (6) 
V ' V ' 

MSPE(al'ys) (Bias 0^5/^)2 

Of course, since we typically do not know fi, minimization of the full expression for MSPE(a^ys) 
is an unrealistic goal in practice. Instead, if adequate information on the covariance matrix S is 
available, one might consider trying to minimize MSPE(a^?/s)- A useful equivalent expression 
for this quantity is 

MSPE(a^l/,) = f,{GrC){I - Bs){GrCflr , (7) 

where C is a nonsingular matrix satisfying S = CC^, such as and Eg is the orthogonal 

projection matrix onto the span of the rows of GgG, i.e., onto Row(GsC). Since orthogonal 
projection matrices are idempotent and symmetric, the MSPE in ^ can be viewed as the square 
of the Euclidean norm of the projection of {GrG)^lr onto the complement of Row^GgC), i.e., 
onto Row(G',C)^ = Nul^G^C). 

In order to better appreciate the interpretation of dTJ, consider the special case of predicting 
a single unmeasured path {i.e., I = except for a single 1 in Ir), with S = /. The MSPE in © 
then simply measures the extent to which the corresponding row of G for this path lies outside 
of Row(Gs). Similarly, the more interesting case of a non-trivial / can be interpreted roughly 
as seeking a subset of k paths for whom the rows in Gs capture as many of the rows of G as 
possible to the largest extent possible. 

From the standpoint of optimization theory, our path- selection problem may be viewed as 
an example of the so-called 'subset selection' problem in computational linear algebra. In the 
case just described, and more generally for diagonal S, the selection of an appropriate subset 
of rows of GC has a meaningful physical interpretation, in terms of the selection of paths, and 
vice versa. Exact solutions to this problem are computationally infeasible (it is known to be 
NP-complete), but the problem is well-studied and an assortment of methods for calculating 
approximate solutions abound. 

The method we have used for the empirical work in this paper was adapted from the subset 
selection method described in Algorithm 12.1.1 of [9]. Essentially, our algorithm makes heuristic 
use of a QR-factorization with column pivoting to find k rows of G that approximate the span 
of the first k left singular vectors of GC. The left singular vectors form an orthonormal basis for 
the range of GC and the magnitude of their corresponding singular values indicates their relative 
importance. Note that these singular values are precisely the square-root of the eigenvalues of 
{GC)-'^{GC), i.e., the decaying spectrum from Section |lll See [17] for additional details. 

For a given choice of k, the overall complexity for the computation of the E-BLP in © is 
dominated by the computation of the SVD of GC, which is Oin^ne). This can likely be improved 
through the use of methods for sparse matrices, since the entries of GC tend to include a large 
fraction of zeros. The other components of the computation are the QR-factorization with column 
pivoting, which is 0{k'^np), and the computation of V^g^ which is only 0{k^). 
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C. Characterization of MSPE Properties 

Analytical arguments are useful for characterizing the expected performance of a predictor 
resulting from the combination of equation ^ with an arbitrary subset selection algorithm. For 
example, we have the following bound. 

Proposition 3.1: Denoting the i^^ row of G as G'(j) , let pi = \\G(i)\\l/\\G\\p , where ||-||2 and 
\\-\\f are the matrix 2-norm and the Frobenius matrix norm, respectively. Let Gs be a rescaled 



version of Gg , under the operation Gi 



'kpi , for each of the k rows in Gs- Then if 



(_T — yjr g\jr g 



</(^)l|G'|| 



for some /(A;), the MSPE can be bounded as 



MSPE(a^y, 



< 



1^112 



1) (A,+i + 2/(A;) IIGII^) ||/| 



2 • 



(8) 



(9) 



Proof of this result may be found in Appendix |lll The inequality in Q shows that the decay 
of the MSPE in k is controlled by two factors. The first factor, A^+i, quantifies how well G may 
be approximated by its first k singular dimensions, and will be small when k is no smaller than 
the effective rank of G. The second factor, /(A;), quantifies the ability of the underlying subset 
selection algorithm to approximate G by a matrix Gg formed from k of its rows. Our empirical 
experience indicates that in practice Afe+i can be much smaller than the term involving /(A;), 
which suggests that the rate of decay is dominated by /(fc). 



To get an idea of the behavior of f{k) for real networks, we computed 
for the Abilene backbone and five networks mapped by the Rocketfuel project, for the subset 
selection algorithm described in Section IIII-BI As can be seen in Figure HI there is a strong 



/l|G||^ 



power-law decay in 



with an exponent that ranges from —0.49 to 



—0.53. The consistency in this decay is quite remarkable, as it says that our subset selection 
algorithm finds a similarly good set of paths, for each k, in each of the networks. That this 
decay in j{k) indeed translates into good MSPE properties will be seen in Section HVl 

Proposition 13.11 holds for any given deterministic path selection algorithm. Perhaps surpris- 
ingly, it is possible to state a similar result for a randomized path selection algorithm. The 
empirical work we present in the next section clearly establishes the practical effectiveness of 
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our methodology using the deterministic algorithm described in Section ITlI-Bl However, effective 
randomized path selection algorithms are interesting to consider in the sense that such algorithms 
would be able to reduce the likelihood that small, highly localized events consistently remain 
outside the span of the sampled paths. The following result, the proof of which follows as a 
direct corollary of Theorem 3 in [18], suggests that an algorithm that randomly selects paths 
roughly in proportion to path length could indeed achieve similar performance characteristics. 

Proposition 3.2: Let Gg be a matrix constructed as in Proposition 13.11 but now from c paths 
randomly selected (with replacement) with respect to the probabilities {pi}. If the estimator d^ys 
in ^ is constructed based on the first (at most) k < c singular dimensions of Gg, then for any 
c <np and 5 > 0, 

MSPE{a^yg) < {Ml + l)(A,+i + 2(1 + VM275))c-^ \\G\\l) (10) 
holds with probability at least 1 — 5. 

IV. Empirical Validation of Prediction Methodology 

In this section, we show how our framework may be applied to address two practical problems 
of interest to network providers and customers. In particular, we show how the appropriate 
selection of small sets of path measurements can be used to (i) accurately estimate network- 
wide averages of path delays, and (ii) reliably detect network delay anomalies. We first describe 
the assembly of our dataset, and then present the two applications. 

A. Data: Abilene Path Delays 

Our methods are applicable to any per- link metric that adds to form per-path metrics. As a 
specific example, we consider delay. In order to validate our prediction methodology, we con- 
structed a full set of path-delay data for the Abilene network, using measurements obtained from 
the NLANR Active Measurement Project (AMP). This project continually performs traceroutes 
between all pairs of AMP monitors on ten minute intervals. Because most AMP monitors are on 
networks with Abilene connections, most traceroutes pass over Abilene. These data thus provide 
a highly detailed view of the state of the Abilene network. 

Beginning with a full set of measurements taken over 3 days in 2003, our approach to 
constructing per-path delays from this data consisted of (i) estimating per- link delays a;*^*^ G 
across the 30 network links, for each consecutive ten minute epoch t, and (ii) computing the per- 
path delays from these per-link delays, on the 110 Abilene paths, using the relation y*^*) = Gx^*\ 
where G is a fixed routing matrix corresponding to the routing on Abilene at the start of the 3 
day period. The end result is a temporally indexed sequence of path-delay vectors y*^*^ over 432 
successive epochs during the 3 day period. Note that the inferred link delays x*^*^ are not explicitly 
used by our prediction methodology, but rather were only necessary for the construction of the 
path delays, after which they were discarded. 

To construct the link delays, for each epoch t, we started with traceroutes between the 14,917 
pairs of AMP monitors for which complete data were available. Links comprising the Abilene 
network were identified by their known interface addresses. Since different traceroutes traverse 
each link at slightly different times, and since each traceroute takes up to three measurements 
per hop, we formed a single estimate of each link's delay by averaging across all the traceroutes 
that measured that link in the current epoch. This approach yields a single measure of delay 
for each link and epoch; while this measure does not capture the variations in delay that occur 
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Fig. 5. Mean relative prediction error as a function of k. 

within a ten minute interval, it provides a realistic and representative value for delay that is 
sufficient for our validation purposes. 

Recall that the statistical framework in Section IIII-AI involved only the first two moments 
of the link delays, the mean fi and the covariance S. While our methodology does not require 
knowledge of fi, since it is estimated at each epoch interval as part of the calculation of our 
predictor, we note here that the mean delays in our data for Abilene's 30 directed edges were 
fairly uniform over the interval from 2 to 36 milliseconds, with standard deviations that ran from 
0.16 to 0.94 over the full three-day period. Our methodology does, however, require knowledge 
of S, which in practice must be elicited from either historical data or possibly periodic, infrequent 
measurements on the links. For the purposes of the validation in this section, we used the per- 
link delays x*^*-* for one day's worth of data to obtain an estimate of S. The entries in this matrix 
were found to be primarily dominated by the diagonal elements, with only a small number 
of off-diagonal entries of similar magnitude. Inspection of the actual delay data suggested that 
the majority of the latter were due to artifacts of the measurement procedure. Therefore, in 
implementing our methodology we took S to be the corresponding diagonal matrix. See [17] for 
additional details, including evidence regarding the improvement gained over the choice S oc /. 

B. Monitoring a Network-Wide Average 

An average is perhaps the most basic network-wide quantity that one might be interested in 
monitoring. So as our first application, we consider the prediction of the average delay over all 
Up = 110 Abilene paths as a function of time t, i.e., /^y*^*) with li = l/up, i = 1, . . . , Up, and 
t = 1, . . . ,432. Using ©, we computed predictions of the network-wide average path delay 
during each epoch, for a choice of A; = 1, . . . , 30 measured paths. The paths were chosen using 
the algorithm described in Section IIII-BI To summarize the accuracy of our predictions, we 
calculated the average relative error for each k, where the average is taken all the 432 epochs. 
The results are shown in Figure |5l 

Recall that exact recovery of all paths delays y^^\ at a given epoch t, requires measurement 
of k* = Rank(G') paths, which in this case means k* = 30. In examining Figure |5l note 
that in comparison a relative error of roughly only 10% is achievable using only k = 7 path 
measurements. Increasing k further improves the accuracy of the prediction up until around A; = 9 
or 10, after which it basically levels out. Since it is roughly at this point that the spectra of the 
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Fig. 6. Predictions of network-wide average path delays, for various choices of k. 



(weighted) routing matrices level out as well, this suggests that our subset selection algorithm 
is indeed doing what we are asking of it, in that it is tracking the effective rank quite closely. 
Additional results of this sort, on a variety of simulated datasets, may be found in [11]. 

To get a better idea of how well the predictors performed, we can compare plots of the 
predictions against a plot of the actual mean delays, as shown in Figure |6l for A; = 3, 5, 7 and 9. 
Note that all of the predictions mirror the rise and fall of the actual network-wide delay quite 
closely — even for A; = 3 measured paths the correlation with the data time-series is p = 0.814. 
However, it is also clear that there is a downward bias in these predictions, and that this bias 
is increasingly prominent as k decreases. The source of this bias can be traced to a lack of 
information on links in the network that are traversed by none of the k measured paths. In fact, 
the generalized inverse used in our E-BLP in ^ simply estimates the corresponding values Xj 
on these links to be zero. Hence, as we reach a point where every link contributes to at least one 
measured path, as it does by roughly k = 10, the bias diminishes accordingly. Note, however, 
that the bias for each k in Figure |6l is fairly constant. This suggests that a small amount of 
additional measurement information could go a long way. 

We implemented a simple method of bias correction, that uses a one-time measurement of a 
sufficient set of paths for complete reconstruction of the link delays (in this case, 30 paths). Since 
it is a one-time-only measurement, it represents a minimal addition to the network measurement 
load. The bias of our prediction for the first epoch was then calculated and used to adjust the 
predictions in the other 431 epochs, which amounts to a simple shift upward of each curve in 
Figure Boxplots of the relative bias remaining after application of this procedure are shown 
in Figure 121 The predictions are now extremely accurate, usually being off by less than 0.3%, 
and almost always within 1% — even when as few as A; = 3 paths are measured. 

Before moving on, we note that this performance on whole networks extends to subnetworks. 
In particular, we have successfully used our methods to make reliable comparisons between 
sub-network delays in the context of multi-homing [17]. 

C. Anomaly Detection 

The application in Section IIV-BI evaluates our predictor by standard statistical summaries, 
in essence looking at the accuracy of the predictor at hitting an unknown target. But it is also 
important to evaluate the accuracy in terms of accomplishing higher-level tasks. One such higher- 
level task of importance is the detection of potentially anomalous events. 
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Fig. 7. Relative bias after bias correction. 



For the purposes of illustration with our Abilene delay data, we define an anomaly as a spike 
in the network-wide average path delay that deviates from the mean of the previous six values 
(i.e., one hour) by more than a prescribed amount. For example, the dots in Figure |9l indicate 
points at which the average path delay differs from the mean of the previous six epochs by more 
than three times their standard deviation. 

To predict when such anomalies occur, we look for spikes in the predicted average path delay, 
calculated as described in Section IIV-BI and using a user-defined threshold. It is interesting to 
examine the effect of choice of both k and this threshold parameter. Insight can be obtained 
by examining ROC (Receiver Operating Characteristic) curves such as those in Figure [H Such 
plots, showing the true positive rate against the false positive rate for different parameter values, 
are a common tool for establishing cutoff values for detection tests. Each curve in Figure [H is 
formed by taking one value for k and varying the threshold level. Examining these curves, one 
sees that for a given threshold, say la, the true positive rate increases with the sample size k 
while the false positive rate stays about the same. Working with a k = 9 prediction, we see that 
the upper-left comer of the ROC curve (the best trade-off between a low false positive rate and 
a high true positive rate) occurs at around 2a. 

In Figure |9l the results are shown for the case k = 9, with a threshold of 2a. Circles have 
been placed along the actual path delay time series at the epochs that were flagged as anomalies 
in the predicted time series. On the whole, this predictor is quite accurate. Most of the major 
spikes are flagged, resulting in a true positive rate of 81%, while the false positive rate is only 
8%. Furthermore, most of these false positives seem to occur at lesser spikes in the actual delay 
data. 

V. Discussion 

The identification of an inherent statistical prediction problem in the task of end-to-end 
network path monitoring — which we have dubbed 'network kriging' — is analogous in its potential 
impact with the identification of, say, traffic matrix estimation with tomography (i.e., 'network 
tomography'). In both cases, the identification serves as a critical pointer to an already established 
literature, through which methodology may be developed by leveraging various principles and 
tools. Here we have focused exclusively on one basic version of the network kriging problem, 
the prediction of linear metrics of path properties using linear modeling principles. However, 
there is ample room for work beyond this, including for example extensions to non-linear metrics 
and temporal prediction models. 
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Fig. 8. ROC curves for predicting 3a spikes. Tlie threshold used to predict the spikes is varied from la to 5a" in increments 
of 0.25(7. 



Fig. 9. Comparison of predicted and actual spikes. The real spikes are those that exceed 3 times the standard deviation of the 
previous 6 epochs. The predicted spikes are those epochs where the rank 9 prediction exceeds 2 times the standard deviation of 
its previous 6 epochs. 

We have successfully demonstrated the promise of our proposed methodology on data obtained 
from a real network. Nevertheless, there are various practical issues to be dealt with to optimize 
our framework for full-scale implementation. These include efficient strategies for dealing with 
monitor failures, link failures and routing changes. We expect that many of these may be 
addressed using tactics similar to those proposed in [7]. 

On a final note, we mention again the fundamental importance of the material in Section III-Bl 
on the prevalence of low effective rank of routing matrices, to the success of our methodology. 
Put simply, low effective rank allows for the possibility of effective network-wide monitoring 
with reduced measurement sets. While we have exploited this characteristic for the purpose of 
path monitoring, it should apply equally well when the goal involves selective monitoring of 
links. The driving factors responsible for the low effective rank are poorly understood and remain 
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an interesting open problem. 

Appendix I 
Proof of Proposition 12. 11 

Let Ai > ■ ■ ■ > A„ denote the eigenvalues of the matrix B and define the spectral radius of 
B = G^G as p{B) = max{|Ai|, . . . , |A„|} = Ai. Now partition the matrix B into B = 
where F is an m x m matrix with m < Ue whose diagonal elements are the m smallest 
betweennesses. Denote the eigenvalues of F by 6'i > ■ ■ ■ > 6*^ and, for convenience, let Oi = —oo 
for z > m and 0^ = oo iox i < 0. Cauchy's Interlace Theorem tells us that the m eigenvalues of 
F are upper bounds for the m smallest eigenvalues of B, respectively. 

For the moment, we focus on the case of 6i > A„_m+i- If we can bound 9i we will have 
a bound for A„_m+i- Define the i-th row sum of B as ri{B) = X]j=il-^«jl' the deleted 
row sum of B as r,;(i?) = X]i=il-^j jl- ^^^en follows, by way of a corollary to Gershgorin's 

Theorem[19], that the spectral radius of F is bounded by 

p{F) < max ri{F) < max ri{B) . (11) 

l<i<m n—m+l<i<n 

Each of these row sums ri(B) can be bounded in terms of the diagonal element Bi^i. To see 
this, note that each of the Bi^i paths that use edge i have a length of at most diam(^). So each 
path can contribute at most diam(^) to the row sum ri{B). This means that for a betweenness 
matrix B we have rj(i?) < Bii diam(^). Recalling that we have ordered the edges such that 
the diagonal elements B^ i are non-increasing, our bound for the spectral radius becomes 

p(F) < Bn-m+i,n-m+i diam{g) . (12) 

Which means that our bound for the eigenvalue of B is 

An-m+l < ^1 < rn-m+liB) < -Bn-m+l,n-m+l diam(^) . (13) 

Finally, note that we are free to choose m = 1, . . . ,n — 1. So for i = 2, . . . , n we have an upper 
bound that decays no worse than the betweenness Bi^i, namely, 

Xi < Bi^i diam(6;) . (14) 

The i = 1 case follows from Gershgorin applied directly to B. 

To establish the second bound, given in © of Proposition 12.11 note that for any unit vector 
u we have ||-Bm||2 < Ai. In particular, for u = ei = [1,0, .. . ,0]"^ we have ||i?ei||2 < Ai. Since 
Bei is the first column of B, we have ||-Bei||oo = Bi i. Thus Bi^i = ||i?ei||oo < ||-Bei||2 = Ai. 
Dividing the left and right-hand sides of (fT?l) by Ai and Bi i, respectively yields the desired 
bound. 

Appendix II 
Proof of Proposition s. 11 

Since MSPE(a'^y,) = - B,)G^l\\l+\\{I - B,)Gml < (Ml+l) W-Bs)Gm 

we need only show || (/ — _Bs)G^||2 < A^+i -|- 2/(/c)||G|||,. To do so, we proceed as in the proof 
of Theorem 3 in [18]. Note that for any vector x E R" with ||a;|| = 1 we can write x = ay + bz. 
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where y E Ss = Row(Gs), z G S^, a, 6 G R, and + 6^ = 1. Using this decomposition of x 
and the sub-linearity of the 2-norm we can write 

\\G^ - BsG^L = max \\x^{G^ - BsG^)\\ (15) 



< max - BsG^)\\ + max \\z^{G^ - (16) 

At this point, we note that since y E SgV^e have y^Bg = y'^, which means that y^(G^ — BgG^) = 
0. Furthermore, z E Sj- means that z^B^ = 0. So the upper bound becomes 



\G' - BsG'\\^< max \\z'G'\\. (17) 

zes^ 



We can bound the maximum in ST71 in terms of Gs by 



\\z' G'\\i = z' {G' G-Gi Gs)z + z' Gi G^z < \\G' G - Gi G,\\f + oi^AGs), (18) 
where ol+i^M) = Afc+i(M), since ||z||2 = 1 and ||M||2 < \\M\\f. Combining ^ and (HHJ) 



gives us 

\\G^ — BsG^^ < o1^^{Gs 



s 



(19) 



Turning to Corollary 8.6.2 in [9, p. 449] we have for k = 1, . . . ,71^ 

Wk+iiG^G) - (Tfc+i(Gfa)| < WG'^G - GjGsh- (20) 

Using the bound for \\G'^G — G'^GsWf that we are given in (lU), and the relation ||M||2 < llMHi? 
for any matrix M we have 

\\G^G - Gj G sh < f{k)\\G\\l. (21) 

Thus Wk+i{G^G) - ak+i{G^Gs)\ = Wl^^{G) - al^,{Gs)\ < \\g^G - G^Gs\\^ < f{k) \\G\\l, 

which leads us to al_^-^^{Gs) < f{k) \\G\\p + al^^iiG). Combined with (O and (EB this yields 
that II (J — i?s)G'^||2 < Afc+i + 2f{k) \\G\fp, as was to be shown. 
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